Redefining CpG islands using hidden Markov models.

نویسندگان

  • Hao Wu
  • Brian Caffo
  • Harris A Jaffee
  • Rafael A Irizarry
  • Andrew P Feinberg
چکیده

The DNA of most vertebrates is depleted in CpG dinucleotide: a C followed by a G in the 5' to 3' direction. CpGs are the target for DNA methylation, a chemical modification of cytosine (C) heritable during cell division and the most well-characterized epigenetic mechanism. The remaining CpGs tend to cluster in regions referred to as CpG islands (CGI). Knowing CGI locations is important because they mark functionally relevant epigenetic loci in development and disease. For various mammals, including human, a readily available and widely used list of CGI is available from the UCSC Genome Browser. This list was derived using algorithms that search for regions satisfying a definition of CGI proposed by Gardiner-Garden and Frommer more than 20 years ago. Recent findings, enabled by advances in technology that permit direct measurement of epigenetic endpoints at a whole-genome scale, motivate the need to adapt the current CGI definition. In this paper, we propose a procedure, guided by hidden Markov models, that permits an extensible approach to detecting CGI. The main advantage of our approach over others is that it summarizes the evidence for CGI status as probability scores. This provides flexibility in the definition of a CGI and facilitates the creation of CGI lists for other species. The utility of this approach is demonstrated by generating the first CGI lists for invertebrates, and the fact that we can create CGI lists that substantially increases overlap with recently discovered epigenetic marks. A CGI list and the probability scores, as a function of genome location, for each species are available at http://www.rafalab.org.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Predicting CpG Islands and Their Relationship with Genomic Feature in Cattle by Hidden Markov Model Algorithm

Cattle supply an important source of nutrition for humans in the world. CpG islands (CGIs) are very important and useful, as they carry functionally relevant epigenetic loci for whole genome studies. As a matter of fact, there have been no formal analyses of CGIs at the DNA sequence level in cattle genomes and therefore this study was carried out to fill the gap. We used hidden markov model alg...

متن کامل

Sequential Modeling for Identifying Gene Locations in Human Genome

We consider several sequential processing algorithms for identifying genes in human DNA, based on detecting CpG islands. The algorithms are designed to capture the underlying statistical structure in a DNA sequence. Sequential processing using a Markov model and a hidden Markov model are shown to identify most CpG islands in annotated (marked) DNA subsequences in publicly available DNA data set...

متن کامل

CpG Island Finding Using Graphical Models

CpG islands are short stretches in DNA sequence whose frequency of cytosine(C)and guanine (G) is higher than background of DNA sequence. They are around the promoter of frequently expressed genes. The conventional way to recognize CpG islands is to use the hidden Markov models (HMMs). While HMMs are known to suffer from not being able to capture long dynamic range information, they usually does...

متن کامل

Hidden Markov Models 3 6

6.1.1 Preface: CpG islands It is known that due to biochemical considerations that CpG, the pair of nocleotides C and G, appearing successively, in this order, along one DNA starnd, is relatively rare in DNA sequences, excluding particular sub-sequences, which are several hundreds of nucleotides long, where the couple CpG is more frequent. These sub-sequences, called CpG islands, are known to a...

متن کامل

Computational Biology Lecture 9: CpG islands, Markov Chains, Hidden Markov Models HMMs

Given a DNA or an amino acid sequence, biologists would like to know what the sequence represents. For instance, is a particular DNA sequence a gene or not? Another example would be to identify which family of proteins a given protein (amino acid sequence) belongs to. In both cases above, we have a sequence of symbols from some alphabet and we are required to say something about the structure o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Biostatistics

دوره 11 3  شماره 

صفحات  -

تاریخ انتشار 2010